if ('knitr' %in% installed.packages() == FALSE) {
install.packages('knitr', repos = 'http://cran.us.r-project.org')
}
library(knitr)
Assume that you have a table with variables that describe a person, Name, age, height, weight and profession. Identify variables that are discrete, continuous, and categorical. (1 mark)
Person
| Variable | Type |
|---|---|
| name | categorical |
| age | discrete |
| height | continuous |
| weight | continuous |
| profession | categorical |
Assume that you have a table with variables that describe a lecturer. Name, gender, subject, semester, and semester, and staff number. Identify variables that are ordinal, interval, and ratio. (1 mark)
Lecturer
| Variable | Type |
|---|---|
| name | nominal |
| gender | nominal |
| subject | nominal |
| semester | ordinal |
| staff number | nominal |
You and a friend wonder if it is “normal” that some bottles of your favourite beer contain more beer than others although the volume is stated as 0.33L. You find out from the manufacturer that the volume of beer in a bottle has a mean of 0.33L and a standard deviation of 0.03. If you now measure the beer volume in the next 100 bottles that you drink with your friend, how many of those 100 bottles are expected to contain more than 0.39L given that the information of the manufacturer is correct? (1 mark)
To solve this problem we can use the central limit theorem that states that if we take a sufficiently large samples of a population, the samples means will be normally distributed even if the population isn’t normally distributed.
So we have the given parameters: \[x = 0.39L\ \mbox{individual value}\] \[\mu = 0.33L\ \mbox{mean}\] \[\sigma = 0.03L \ \mbox{Standard deviation}\]
Now we need to calculate the \(z\) score for a normal distribution.
\[ z = \frac{x - \mu}{\sigma}\] Using the previous values: \[ z = \frac{0.39 - 0.33}{0.03} = \frac{0.06}{0.03} = 2\] Now that we have the \(z\) score the next step is to find the probability for this value in the \(z\) score table for normal probabilities.
For \(z = 2\) we have the probability of \(0.9772\) This means that the probability of getting a coke can 0.39L is \(0.9772\)
\[\mathcal{P}(X = 0.39) = 0.9772\] So to calculate \(\mathcal{P}(X > 0.39)\) and because we are talking a continuous variable we can say: \[\mathcal{P}(X > 0.39) = 1 - \mathcal{P}(X = 0.39)\] \[\mathcal{P}(X > 0.39) = 1 - 0.9772 = 0.0228\] So for the next 100 bottles we have the probability of find \((100 * 0.0228) = 2.28\) bottles with more than 0.39L.
Use the salary.rds dataset from the lecture 1
Install the following packages Hmisc,
pastecs, psych
if ('Hmisc' %in% installed.packages() == FALSE) {
install.packages('Hmisc', repos = 'http://cran.us.r-project.org')
}
if ('pastecs' %in% installed.packages() == FALSE) {
install.packages('pastecs', repos = 'http://cran.us.r-project.org')
}
if ('psych' %in% installed.packages() == FALSE) {
install.packages('psych', repos = 'http://cran.us.r-project.org')
}
Describe the data using installed packaged and identify the differences in description by different package
describe shows a summary of the data showing the
standard variation, median, quartiles
highest and lowers presenting the data as a frequency
table per variable. It also shows presents a histogram if the variable
is numeric.
pastec.stat shows a table with descriptive statistics
only for numerical variables. It presents various dispersed
variables like mean, median, variance,
stardand variation, range, min, max,
It shows the Standard Error Mean, and the Confidence
Interfal of the Mean.
describe shows a table with the number of sample
(discards the null values), mean, median, trimmed
mean, min value, max value, range,
standard deviation, standand error, it is less data
than the pastecs package but shows the skew and
kurtosis of the variables. Variables that are categorical or
logical are converted to numerical and marked with a *
library(Hmisc, warn.conflicts = FALSE)
library(pastecs, warn.conflicts = FALSE)
library(psych, warn.conflicts = FALSE)
salary <- readRDS("data/salary.rds")
description.Hmisc <- Hmisc::describe(salary)
description.pastecs <- pastecs::stat.desc(salary)
description.psych <- psych::describe(salary)
html(description.Hmisc)
| n | missing | distinct |
|---|---|---|
| 52 | 0 | 2 |
Value Female Male Frequency 14 38 Proportion 0.269 0.731
| n | missing | distinct |
|---|---|---|
| 52 | 0 | 3 |
Value Assistant Associate Full Frequency 18 14 20 Proportion 0.346 0.269 0.385
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52 | 0 | 18 | 0.995 | 7.481 | 6.174 | 0.55 | 1.00 | 3.00 | 7.00 | 11.00 | 14.80 | 16.00 |
Value 0 1 2 3 4 5 6 7 8 9 10 11
Frequency 3 4 4 5 4 2 2 3 3 5 3 3
Proportion 0.058 0.077 0.077 0.096 0.077 0.038 0.038 0.058 0.058 0.096 0.058 0.058
Value 12 13 15 16 19 25
Frequency 1 4 1 3 1 1
Proportion 0.019 0.077 0.019 0.058 0.019 0.019
For the frequency table, variable is rounded to the nearest 0
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 52 | 0 | 2 | 0.679 | 34 | 0.6538 | 0.4615 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52 | 0 | 29 | 0.998 | 16.12 | 11.85 | 1.00 | 2.10 | 6.75 | 15.50 | 23.25 | 30.90 | 31.45 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52 | 0 | 51 | 1 | 23798 | 6755 | 16125 | 16519 | 18247 | 23719 | 27258 | 31903 | 34440 |
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 52 | 0 | 7 | 0.972 | 3.654 | 2.327 |
Value 1 2 3 4 5 6 7 Frequency 12 4 10 7 8 5 6 Proportion 0.231 0.077 0.192 0.135 0.154 0.096 0.115For the frequency table, variable is rounded to the nearest 0
kable(description.pastecs)
| gender | rank | yr | dg | exper | salary | expcat | |
|---|---|---|---|---|---|---|---|
| nbr.val | NA | NA | 52.0000000 | 52.0000000 | 52.0000000 | 5.200000e+01 | 52.0000000 |
| nbr.null | NA | NA | 3.0000000 | 18.0000000 | 0.0000000 | 0.000000e+00 | 0.0000000 |
| nbr.na | NA | NA | 0.0000000 | 0.0000000 | 0.0000000 | 0.000000e+00 | 0.0000000 |
| min | NA | NA | 0.0000000 | 0.0000000 | 1.0000000 | 1.500000e+04 | 1.0000000 |
| max | NA | NA | 25.0000000 | 1.0000000 | 35.0000000 | 3.804500e+04 | 7.0000000 |
| range | NA | NA | 25.0000000 | 1.0000000 | 34.0000000 | 2.304500e+04 | 6.0000000 |
| sum | NA | NA | 389.0000000 | 34.0000000 | 838.0000000 | 1.237478e+06 | 190.0000000 |
| median | NA | NA | 7.0000000 | 1.0000000 | 15.5000000 | 2.371900e+04 | 3.5000000 |
| mean | NA | NA | 7.4807692 | 0.6538462 | 16.1153846 | 2.379765e+04 | 3.6538462 |
| SE.mean | NA | NA | 0.7637579 | 0.0666173 | 1.4175835 | 8.205804e+02 | 0.2812446 |
| CI.mean | NA | NA | 1.5333079 | 0.1337399 | 2.8459176 | 1.647384e+03 | 0.5646220 |
| var | NA | NA | 30.3329563 | 0.2307692 | 104.4962293 | 3.501431e+07 | 4.1131222 |
| std.dev | NA | NA | 5.5075363 | 0.4803845 | 10.2223397 | 5.917289e+03 | 2.0280834 |
| coef.var | NA | NA | 0.7362259 | 0.7347056 | 0.6343218 | 2.486501e-01 | 0.5550544 |
kable(description.psych)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gender* | 1 | 52 | 1.730769e+00 | 0.4478876 | 2.0 | 1.785714e+00 | 0.0000 | 1 | 2 | 1 | -1.0106614 | -0.9966271 | 0.0621108 |
| rank* | 2 | 52 | 2.038461e+00 | 0.8623165 | 2.0 | 2.047619e+00 | 1.4826 | 1 | 3 | 2 | -0.0713405 | -1.6773420 | 0.1195818 |
| yr | 3 | 52 | 7.480769e+00 | 5.5075363 | 7.0 | 7.023809e+00 | 5.9304 | 0 | 25 | 25 | 0.7468534 | 0.3085015 | 0.7637579 |
| dg | 4 | 52 | 6.538462e-01 | 0.4803845 | 1.0 | 6.904762e-01 | 0.0000 | 0 | 1 | 1 | -0.6281951 | -1.6357249 | 0.0666173 |
| exper | 5 | 52 | 1.611538e+01 | 10.2223397 | 15.5 | 1.595238e+01 | 12.6021 | 1 | 35 | 34 | 0.0728612 | -1.2024045 | 1.4175835 |
| salary | 6 | 52 | 2.379765e+04 | 5917.2891544 | 23719.0 | 2.338926e+04 | 6643.5306 | 15000 | 38045 | 23045 | 0.4476630 | -0.6010913 | 820.5803638 |
| expcat | 7 | 52 | 3.653846e+00 | 2.0280834 | 3.5 | 3.571429e+00 | 2.2239 | 1 | 7 | 6 | 0.1475296 | -1.2300702 | 0.2812446 |
Generate summary statistics by using grouping by Gender. (1 mark)
Hint: use package psych
description.psych.by_gender <- psych::describeBy(salary, group=salary$gender)
render.description.psych.by_gender <- lapply(names(description.psych.by_gender),
function(name){
knitr::kable(description.psych.by_gender[name], caption = name)})
render.description.psych.by_gender
[[1]]
|
[[2]]
|
Load iris dataset into workspace.
Identify mean, median, range, 98th percentile of Petal.Length (1 mark)
petalLenght.mean <- mean(iris$Petal.Length)
petalLenght.median <- median(iris$Petal.Length)
petalLenght.range <- range(iris$Petal.Length)
petalLength.98percentile <- quantile(iris$Petal.Length, 0.98)
print(paste('Mean Petal Length:', petalLenght.mean))
## [1] "Mean Petal Length: 3.758"
print(paste('Median Petal Length:', petalLenght.median))
## [1] "Median Petal Length: 4.35"
print(paste('Range Petal Length min:', petalLenght.range[1], ' max:', petalLenght.range[2]))
## [1] "Range Petal Length min: 1 max: 6.9"
print(paste('98% Percentile Petal Length:', petalLength.98percentile))
## [1] "98% Percentile Petal Length: 6.602"
Draw the histogram for Sepal.Width, mention which measure of dispersion method suits the best? (1 mark)
The histogram reveals a bell-shaped curve reminiscent of the normal distribution. Given the data’s normal distribution with a continuous variable, it’s advisable to utilize the mean and standard deviation. Opting for the standard deviation over the variance is preferable since it preserves the units of the variable and facilitates easier comprehension.
hist(iris$Sepal.Width, main = 'Histogram of Iris Petal With', xlab = 'Iris Sepal With')
sepalWidth.range <- range(iris$Sepal.Width)
sepalWidth.variance <- var(iris$Sepal.Width)
sepalWidth.sd <- sd(iris$Sepal.Width)
sepalWidth.iqr <- IQR(iris$Petal.Width)
# Print the measures of dispersion
print(paste("Range of Sepal Width: [", sepalWidth.range[1], ',', sepalWidth.range[2], ']'))
## [1] "Range of Sepal Width: [ 2 , 4.4 ]"
print(paste("Variance of Sepal Width:", sepalWidth.variance))
## [1] "Variance of Sepal Width: 0.189979418344519"
print(paste("Standard Deviation of Sepal Width:", sepalWidth.sd))
## [1] "Standard Deviation of Sepal Width: 0.435866284936698"
print(paste("Interquartile Range of Sepal Width:", sepalWidth.iqr))
## [1] "Interquartile Range of Sepal Width: 1.5"
Load HairEyeColor dataset into workspace.
Hint: dataHairEye <- as.data.frame(HairEyeColor)
As a customer, I would like to know the total number of people with various color combination of hair and eyes. Which chart suits best for this task? Plot the same. (1 mark)
For this dataset we are counting the value of two categorical variables, so we need to find a way to see this two variables and how they correlate each other.
I think that the geom_point labelled with geom_text is the chart that suits best.
if ('ggplot2' %in% installed.packages() == FALSE) {
install.packages('ggplot2', repos = 'http://cran.us.r-project.org')
}
library(ggplot2, warn.conflicts = FALSE)
data(HairEyeColor)
dataHairEye <- as.data.frame(HairEyeColor)
dataHairEye.aggregated <- aggregate(Freq ~ Hair + Eye, data = dataHairEye, FUN = sum)
ggplot(data = dataHairEye.aggregated, aes(x = Hair, y = Eye, size = Freq)) +
geom_point(color = "black") +
scale_size_continuous(range = c(5, 30), guide = "none") +
ggtitle("Hair and Eye Color Combinations") +
geom_text(aes(label = Freq), size = 4, color="white") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
A meteorologist wants to compare the annual average rain fall between two cities for the past 20 years. Which plot is most suitable? Plot the graph by generating 20 random data points between 0 and 28 for Dublin and Cork. (2 marks)
For a dataset comprising 20 data points and aiming to compare two cities, I think that an area chart is the best one.
Employing a bar plot in this context could potentially lead to confusion due to the multitude of bars per year and per city.
Therefore, opting for an area chart would be more advantageous for comparing the categories of Dublin and Cork and visualizing their changes over time.
if ('tidyr' %in% installed.packages() == FALSE) {
install.packages('tidyr', repos = 'http://cran.us.r-project.org')
}
library(tidyr, warn.conflicts = FALSE)
current_year <- as.numeric(format(Sys.Date(), "%Y"))
rain_data <- data.frame(Year = (current_year - 20):(current_year - 1),
Cork = runif(20, 0, 28),
Dublin = runif(20, 0, 28))
df_rain_data <- gather(rain_data, City, Rain, c(Dublin,Cork))
ggplot(data=df_rain_data, aes(x = Year, fill = City)) +
geom_area(aes(y = Rain), position = position_dodge(width = 0), alpha=0.8) +
ylab("Average Rain") +
ggtitle("Average rain per Year in Dublin and Cork") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Load the provided world-small.csv file. (2 marks)
df_world_small <- read.csv("data/world-small.csv", header = TRUE)
ggplot(df_world_small, aes(x = gdppcap08, fill = region)) +
geom_histogram(binwidth = 1000) +
labs(title = "GDP per Capita in 2008", x = "GDP per Capita", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
To have a good understand I decided to start with the region and after create one per country with more datapoints.
ggplot(df_world_small, aes(y = polityIV, x = region)) +
geom_boxplot() +
labs(title = "Polity IV Per Region Scores Chart", x = "Region", y = "Polity IV Score") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
We can also do the chart per country but has too much detail. So I decided to install the package plotly to have a interactive visualization. I also decide to reorder the x-axis by polityIV to simplify the visualization.
if ('plotly' %in% installed.packages() == FALSE) {
install.packages('plotly', repos = 'http://cran.us.r-project.org')
}
library(plotly, warn.conflicts = FALSE)
pl <- ggplot(df_world_small, aes(x = reorder(country, polityIV), y = polityIV)) +
geom_boxplot() +
labs(title = "Polity IV Per Country Scores Chart", x = "Country", y = "Polity IV Score") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(pl)
By the chart in (i) we can say that the region with the biggest GDP per capita is “Middle East”. We can confirm finding the maximum gdppcap08 per region in the data.
region_biggest_gdpcap08 <- df_world_small[which.max(df_world_small$gdppcap08), "region"]
print(region_biggest_gdpcap08)
## [1] "Middle East"
By the chart in (ii) - Polit IV per Country Scores chart if we zoom it in the beggining at the x-axis,we can tell that the countries with lower polityiv are Qatar and Saudi Arabia
countries_with_min_polityiv <- df_world_small[df_world_small$polityIV == min(df_world_small$polityIV), "country"]
print(countries_with_min_polityiv)
## [1] "Qatar" "Saudi Arabia"
Table 1 represents people in Dublin who like to own certain types of pets. (2 marks)
Table 1: Pet Lovers
| Pet | Number of people |
|---|---|
| Dogs | 2034 |
| Cats | 492 |
| Fish | 785 |
| Macaw | 298 |
pets_text <- "Pet Number_of_people
Dogs 2034
Cats 492
Fish 785
Macaw 298"
df_pets <- read.table(text = pets_text, header = TRUE)
ggplot(data = df_pets, aes(x = Pet, y = Number_of_people)) +
geom_bar(stat = "identity") +
labs(title = "Pet Lovers Bar Chart", x = "Pet", y = "Number of People") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Looking at the pie chart of pets, I find it challenging to discern and compare the number of individuals who favor each type of pet. In my opinion, a pie chart is more effective for representing proportions rather than absolute values.
ggplot(data = df_pets, aes(x ="" , y = Number_of_people, fill = Pet)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start=0) +
labs(title = "Pet Lovers Pie Chart") +
theme_bw() +
theme_void() +
theme(plot.title = element_text(hjust = 0.5))